Finding Cross-Lingual Spelling Variants
نویسنده
چکیده
Finding term translations as cross-lingual spelling variants on the fly is an important problem for cross-lingual information retrieval (CLIR). CLIR is typically approached by automatically translating a query into the target language. For an overview of cross-lingual information retrieval, see [1]. When automatically translating the query, specialized terminology is often missing from the translation dictionary. The analysis of query properties in [2] has shown that proper names and technical terms often are prime keys in queries, and if not properly translated or transliterated, query performance may deteriorate significantly. As proper names often need no translation, a trivial solution is to include the untranslated keys as such into the target language query. However, technical terms in European languages often have common Greek or Latin roots, which allows for a more advanced solution using approximate string matching to find the word or words most similar to the source keys in the index of the target language text database [3]. A comparison of methods applied to cross-lingual spelling variants in CLIR for a number of European languages is provided in [4]. They compare exact match, simple edit distance, longest common subsequence, digrams, trigrams and tetragrams as well as skipgrams, i.e. digrams with gaps. Skipgrams perform best in their comparison with a relative improvement of 7.5 % on the average on the simple edit distance baseline. They also show that among the baselines, the simple edit distance baseline is in general the hardest baseline to beat. They use no explicit n-gram transformation information. In [5], explicit n-gram transformations are based on digrams and trigrams. Trigrams are better than digrams, but no comparison is made to the edit distance baseline. In both of the previous studies on European languages most of the distance measures for finding the closest matching transformations is based on a bag of n-grams ignoring the order of the n-grams. Between languages with different writing systems foreign words are often borrowed based on phonetic rather than orthographic transliterations. In [6], a generative model is introduced which transliterates words from Japanese to English using weighted finite-state transducers. The transducer model only uses context-free transliterations which do not account for the fact that a sound may be spelled differently in different contexts. This is likely to produce heavily overgenerating systems. The first contribution of this work is to show that a distance measure which explicitly accounts for the order of the letter or sound …
منابع مشابه
Dictionary-independent translation in CLIR between closely related languages
This paper presents results from a study, where fuzzy string matching techniques were used as the sole query translation technique in Cross Language Information Retrieval (CLIR) between the closely related languages Swedish and Norwegian. It is a novel research idea to apply only fuzzy string matching techniques in query translation. Closely related languages share a number of words that are cr...
متن کاملTranslating cross-lingual spelling variants using transformation rules
Technical terms and proper names constitute a major problem in dictionary-based crosslanguage information retrieval (CLIR). However, technical terms and proper names in different languages often share the same Latin or Greek origin, being thus spelling variants of each other. In this paper we present a novel two-step fuzzy translation technique for cross-lingual spelling variants. In the first ...
متن کاملWhen Harry Met Harri, and : Cross-lingual Name Spelling Normalization
Foreign name translations typically include multiple spelling variants. These variants cause data sparseness problems, increase Out-of-Vocabulary (OOV) rate, and present challenges for machine translation, information extraction and other NLP tasks. This paper aims to identify name spelling variants in the target language using the source name as an anchor. Based on wordto-word translation and ...
متن کاملWhen Harry Met Harri: Cross-lingual Name Spelling Normalization
Foreign name translations typically include multiple spelling variants. These variants cause data sparseness problems, increase Out-of-Vocabulary (OOV) rate, and present challenges for machine translation, information extraction and other NLP tasks. This paper aims to identify name spelling variants in the target language using the source name as an anchor. Based on wordto-word translation and ...
متن کاملCross-lingual acoustic modeling for dialectal Arabic speech recognition
Amajor problem with dialectal Arabic acoustic modeling is due to the very sparse available speech resources. In this paper, we have chosen Egyptian Colloquial Arabic (ECA) as a typical dialect. In order to benefit from existing Modern Standard Arabic (MSA) resources, a cross-lingual acoustic modeling approach is proposed that is based on supervised model adaptation. MSA acoustic models were ada...
متن کامل